Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

add catastrophe recovery for cassandra #1579

Merged
merged 10 commits into from
Jan 17, 2020

Conversation

robert-milan
Copy link
Contributor

@robert-milan robert-milan commented Dec 20, 2019

This PR adds logic to reconnect to a cassandra cluster in case of catastrophe. The only known case where this happens is when all of the cassandra nodes go down at a similar time and all of their IP addresses change. When this happened previously, Metrictank had no way of reconnecting to the new cassandra cluster without restarting Metrictank processes. In small deployments this might not be a big deal, but for larger deployments that take a long time transition to a ready state this is not a good option.

After we updated gocql (it was about 2 years out of date for us) I saw very inconsistent results while reproducing a catastrophic failure. Sometimes it would reconnect without issues, other times it would not.

Fixes: #1566
See also: apache/cassandra-gocql-driver/issues/831

@robert-milan robert-milan force-pushed the recover-from-cassandra-catastrophe branch from dc42b90 to 145d324 Compare December 25, 2019 09:49
@robert-milan robert-milan changed the title WIP: add catastrophe recovery for cassandra add catastrophe recovery for cassandra Dec 25, 2019
@robert-milan robert-milan requested a review from replay December 25, 2019 16:03
@replay
Copy link
Contributor

replay commented Dec 26, 2019

The .Session property in both the store and the index is a pointer. If we detect that the current
connection is down, wouldn't it be way easier to first instantiate a new one, and once this is successful use atomic.SwapPointer() to replace the .Session property. Obviously every location which accesses this session will need to use atomic.LoadPointer() to access it, but this could be wrapped in some .getSession() method.
Then we would not have to add another lock

@replay
Copy link
Contributor

replay commented Dec 26, 2019

I think it's not very nice that deadConnectionCheck() is basically duplicate (when ignoring the log statements). It would be nicer to find a way to de-duplicate that. For example one possibility would be to create a new struct which encapsulates the Cassandra connection and has a .GetSession() method, this struct could then handle the dead connection check and the reconnecting. This would also make it easier to unit test the dead connection check / reconnection logic.

@robert-milan
Copy link
Contributor Author

I think it's not very nice that deadConnectionCheck() is basically duplicate (when ignoring the log statements). It would be nicer to find a way to de-duplicate that. For example one possibility would be to create a new struct which encapsulates the Cassandra connection and has a .GetSession() method, this struct could then handle the dead connection check and the reconnecting. This would also make it easier to unit test the dead connection check / reconnection logic.

I somewhat agree, but I think this would add a level of complexity that we don't want to deal with. It's not only about acquiring a valid session, but also holding that read lock until all of the iterating is done on the query. This isn't a problem when executing statements, but when dealing with Iterators or Scanners it can be a problem, especially if paging is involved because it may execute more queries as we iterate over it. There is a case where a function would acquire a valid session, but then that session could become invalid or a completely new session while it is resubmitting queries on an iterator. I don't know if that would cause problems or not, but it may. I will look into that more.

@replay
Copy link
Contributor

replay commented Jan 7, 2020

I somewhat agree, but I think this would add a level of complexity that we don't want to deal with. It's not only about acquiring a valid session, but also holding that read lock until all of the iterating is done on the query.

I don't think it would be necessary to hold the read lock until all of the iterating is done on the query. If the old session dies and gets replaced in the store/idx, then all the subsequent queries on already instantiated iterators would fail anyway because the session is a member of the iterator and it won't be replaced inside the iterator. If we close the dead connection while some iterator is trying to use it, then it would only fail faster, which is good.

This isn't a problem when executing statements, but when dealing with Iterators or Scanners it can be a problem, especially if paging is involved because it may execute more queries as we iterate over it. There is a case where a function would acquire a valid session, but then that session could become invalid or a completely new session while it is resubmitting queries on an iterator. I don't know if that would cause problems or not, but it may. I will look into that more.

I think once a session dies all the subsequent queries of existing iterators which are using this session are going to fail in any case, no matter what we do with the session property in the cassandra idx/store. Important is only that once we replaced the dead connection with a working one all the new queries which get created from that point on will succeed.

@woodsaj
Copy link
Member

woodsaj commented Jan 7, 2020

why do we need to hold sessionLock for the duration of the query. Couldnt we simplify this with something like

func (c *CassandraStore) CurrentSession() *gocql.Session {
  var session *gocql.Session
  c.sessionLock.RLock()
  session = c.Session
  c.sessionLock.RUnlock()
  return session
}

func (c *CassandraStore) FindExistingTables(keyspace string) error {
  session := c.CurrentSession()
  meta, err := session.KeyspaceMetadata(keyspace)
  ....

@robert-milan
Copy link
Contributor Author

robert-milan commented Jan 8, 2020

why do we need to hold sessionLock for the duration of the query. Couldnt we simplify this with something like

func (c *CassandraStore) CurrentSession() *gocql.Session {
  var session *gocql.Session
  c.sessionLock.RLock()
  session = c.Session
  c.sessionLock.RUnlock()
  return session
}

func (c *CassandraStore) FindExistingTables(keyspace string) error {
  session := c.CurrentSession()
  meta, err := session.KeyspaceMetadata(keyspace)
  ....

That might work for specific functions, but not all of them. When we use gocql.session.Query and then further Iter(), the *Query that is returned also has a pointer to the session as a member which it uses to session.executeQuery to get the *Iter. If we don't keep the lock for the duration of that call then there is a high likelihood of encountering undefined behavior or a SEGFAULT because the old *session may have been replaced, re-used by other parts of the process, or set to nil. There is also a high chance of this happening because the session was closed / destroyed after the call to CurrentSession() but before FindExistingTables was able to execute KeyspaceMetadata.

Let's say that the above never happens because of how we are actually chaining these functions. The other scenario where this is a problem is during *Iter.Scan(). This can also attempt to access a *session from the original *query, which could have changed or become nil. As far as I can tell this only happens when paging is enabled and there are multiple pages of data available.

Part of the reason we wanted to update gocql in the first place was so that we could start using some of the batch functions. I haven't looked at those yet, but I think a similar problem may exist. There is a bit of asynchronous work being done inside the library itself.

@woodsaj
Copy link
Member

woodsaj commented Jan 8, 2020

If we don't keep the lock for the duration of that call then there is a high likelihood of encountering undefined behavior or a SEGFAULT because the old *session may have been replaced,

This wouldn't happen. c.Session is a *gocql.Session. So the value of c.Session is just a pointer.
If you create a new session it gets a new pointer address. When you assign the new session to c.Session, it just changes what c.Session points to. Anyone who already has a copy of the pointer address to the old *gocql.Session can keep using it.

@robert-milan
Copy link
Contributor Author

If we don't keep the lock for the duration of that call then there is a high likelihood of encountering undefined behavior or a SEGFAULT because the old *session may have been replaced,

This wouldn't happen. c.Session a *gocql.Session. So the value of c.Session is just a pointer.
If you create a new session it gets a new pointer address. When you assign the new session to c.Session, it just changes what c.Session points to. Anyone who already has a copy of the pointer address to the old *gocql.Session can keep using it.

After reading over the code more I agree with you. I will make updates to this PR to simplify it so we don't need to hold locks open the entire time. Not sure what I was thinking with that line of reasoning exactly.

@robert-milan robert-milan force-pushed the recover-from-cassandra-catastrophe branch from 845ba3b to 17d2658 Compare January 14, 2020 20:02
@robert-milan robert-milan force-pushed the recover-from-cassandra-catastrophe branch from 17d2658 to 4fb852f Compare January 14, 2020 20:19
@robert-milan robert-milan requested a review from replay January 14, 2020 20:25
Copy link
Contributor

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just address @fitzoh's last comment

@robert-milan robert-milan merged commit 045e2f9 into master Jan 17, 2020
@robert-milan robert-milan deleted the recover-from-cassandra-catastrophe branch January 17, 2020 00:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metrictank does not rediscover cassandra node IPs correctly
4 participants