-
Notifications
You must be signed in to change notification settings - Fork 107
add catastrophe recovery for cassandra #1579
Conversation
dc42b90
to
145d324
Compare
The |
I think it's not very nice that |
I somewhat agree, but I think this would add a level of complexity that we don't want to deal with. It's not only about acquiring a valid session, but also holding that read lock until all of the iterating is done on the query. This isn't a problem when executing statements, but when dealing with Iterators or Scanners it can be a problem, especially if paging is involved because it may execute more queries as we iterate over it. There is a case where a function would acquire a valid session, but then that session could become invalid or a completely new session while it is resubmitting queries on an iterator. I don't know if that would cause problems or not, but it may. I will look into that more. |
I don't think it would be necessary to hold the read lock until all of the iterating is done on the query. If the old session dies and gets replaced in the store/idx, then all the subsequent queries on already instantiated iterators would fail anyway because the session is a member of the iterator and it won't be replaced inside the iterator. If we close the dead connection while some iterator is trying to use it, then it would only fail faster, which is good.
I think once a session dies all the subsequent queries of existing iterators which are using this session are going to fail in any case, no matter what we do with the session property in the cassandra idx/store. Important is only that once we replaced the dead connection with a working one all the new queries which get created from that point on will succeed. |
why do we need to hold sessionLock for the duration of the query. Couldnt we simplify this with something like
|
That might work for specific functions, but not all of them. When we use Let's say that the above never happens because of how we are actually chaining these functions. The other scenario where this is a problem is during Part of the reason we wanted to update gocql in the first place was so that we could start using some of the batch functions. I haven't looked at those yet, but I think a similar problem may exist. There is a bit of asynchronous work being done inside the library itself. |
This wouldn't happen. c.Session is a *gocql.Session. So the value of c.Session is just a pointer. |
After reading over the code more I agree with you. I will make updates to this PR to simplify it so we don't need to hold locks open the entire time. Not sure what I was thinking with that line of reasoning exactly. |
845ba3b
to
17d2658
Compare
17d2658
to
4fb852f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just address @fitzoh's last comment
This PR adds logic to reconnect to a cassandra cluster in case of catastrophe. The only known case where this happens is when all of the cassandra nodes go down at a similar time and all of their IP addresses change. When this happened previously, Metrictank had no way of reconnecting to the new cassandra cluster without restarting Metrictank processes. In small deployments this might not be a big deal, but for larger deployments that take a long time transition to a ready state this is not a good option.
After we updated gocql (it was about 2 years out of date for us) I saw very inconsistent results while reproducing a catastrophic failure. Sometimes it would reconnect without issues, other times it would not.
Fixes: #1566
See also: apache/cassandra-gocql-driver/issues/831