add catastrophe recovery for cassandra #1579

robert-milan · 2019-12-20T14:06:45Z

This PR adds logic to reconnect to a cassandra cluster in case of catastrophe. The only known case where this happens is when all of the cassandra nodes go down at a similar time and all of their IP addresses change. When this happened previously, Metrictank had no way of reconnecting to the new cassandra cluster without restarting Metrictank processes. In small deployments this might not be a big deal, but for larger deployments that take a long time transition to a ready state this is not a good option.

After we updated gocql (it was about 2 years out of date for us) I saw very inconsistent results while reproducing a catastrophic failure. Sometimes it would reconnect without issues, other times it would not.

Fixes: #1566
See also: apache/cassandra-gocql-driver/issues/831

replay · 2019-12-26T17:19:31Z

The .Session property in both the store and the index is a pointer. If we detect that the current
connection is down, wouldn't it be way easier to first instantiate a new one, and once this is successful use atomic.SwapPointer() to replace the .Session property. Obviously every location which accesses this session will need to use atomic.LoadPointer() to access it, but this could be wrapped in some .getSession() method.
Then we would not have to add another lock

store/cassandra/cassandra.go

idx/cassandra/meta_records.go

idx/cassandra/cassandra.go

replay · 2019-12-26T21:13:32Z

I think it's not very nice that deadConnectionCheck() is basically duplicate (when ignoring the log statements). It would be nicer to find a way to de-duplicate that. For example one possibility would be to create a new struct which encapsulates the Cassandra connection and has a .GetSession() method, this struct could then handle the dead connection check and the reconnecting. This would also make it easier to unit test the dead connection check / reconnection logic.

robert-milan · 2019-12-30T07:08:48Z

I think it's not very nice that deadConnectionCheck() is basically duplicate (when ignoring the log statements). It would be nicer to find a way to de-duplicate that. For example one possibility would be to create a new struct which encapsulates the Cassandra connection and has a .GetSession() method, this struct could then handle the dead connection check and the reconnecting. This would also make it easier to unit test the dead connection check / reconnection logic.

I somewhat agree, but I think this would add a level of complexity that we don't want to deal with. It's not only about acquiring a valid session, but also holding that read lock until all of the iterating is done on the query. This isn't a problem when executing statements, but when dealing with Iterators or Scanners it can be a problem, especially if paging is involved because it may execute more queries as we iterate over it. There is a case where a function would acquire a valid session, but then that session could become invalid or a completely new session while it is resubmitting queries on an iterator. I don't know if that would cause problems or not, but it may. I will look into that more.

replay · 2020-01-07T17:47:29Z

I somewhat agree, but I think this would add a level of complexity that we don't want to deal with. It's not only about acquiring a valid session, but also holding that read lock until all of the iterating is done on the query.

I don't think it would be necessary to hold the read lock until all of the iterating is done on the query. If the old session dies and gets replaced in the store/idx, then all the subsequent queries on already instantiated iterators would fail anyway because the session is a member of the iterator and it won't be replaced inside the iterator. If we close the dead connection while some iterator is trying to use it, then it would only fail faster, which is good.

This isn't a problem when executing statements, but when dealing with Iterators or Scanners it can be a problem, especially if paging is involved because it may execute more queries as we iterate over it. There is a case where a function would acquire a valid session, but then that session could become invalid or a completely new session while it is resubmitting queries on an iterator. I don't know if that would cause problems or not, but it may. I will look into that more.

I think once a session dies all the subsequent queries of existing iterators which are using this session are going to fail in any case, no matter what we do with the session property in the cassandra idx/store. Important is only that once we replaced the dead connection with a working one all the new queries which get created from that point on will succeed.

woodsaj · 2020-01-07T18:33:54Z

why do we need to hold sessionLock for the duration of the query. Couldnt we simplify this with something like

func (c *CassandraStore) CurrentSession() *gocql.Session {
  var session *gocql.Session
  c.sessionLock.RLock()
  session = c.Session
  c.sessionLock.RUnlock()
  return session
}

func (c *CassandraStore) FindExistingTables(keyspace string) error {
  session := c.CurrentSession()
  meta, err := session.KeyspaceMetadata(keyspace)
  ....

robert-milan · 2020-01-08T11:57:32Z

why do we need to hold sessionLock for the duration of the query. Couldnt we simplify this with something like

func (c *CassandraStore) CurrentSession() *gocql.Session {
  var session *gocql.Session
  c.sessionLock.RLock()
  session = c.Session
  c.sessionLock.RUnlock()
  return session
}

func (c *CassandraStore) FindExistingTables(keyspace string) error {
  session := c.CurrentSession()
  meta, err := session.KeyspaceMetadata(keyspace)
  ....

That might work for specific functions, but not all of them. When we use gocql.session.Query and then further Iter(), the *Query that is returned also has a pointer to the session as a member which it uses to session.executeQuery to get the *Iter. If we don't keep the lock for the duration of that call then there is a high likelihood of encountering undefined behavior or a SEGFAULT because the old *session may have been replaced, re-used by other parts of the process, or set to nil. There is also a high chance of this happening because the session was closed / destroyed after the call to CurrentSession() but before FindExistingTables was able to execute KeyspaceMetadata.

Let's say that the above never happens because of how we are actually chaining these functions. The other scenario where this is a problem is during *Iter.Scan(). This can also attempt to access a *session from the original *query, which could have changed or become nil. As far as I can tell this only happens when paging is enabled and there are multiple pages of data available.

Part of the reason we wanted to update gocql in the first place was so that we could start using some of the batch functions. I haven't looked at those yet, but I think a similar problem may exist. There is a bit of asynchronous work being done inside the library itself.

woodsaj · 2020-01-08T12:40:03Z

If we don't keep the lock for the duration of that call then there is a high likelihood of encountering undefined behavior or a SEGFAULT because the old *session may have been replaced,

This wouldn't happen. c.Session is a *gocql.Session. So the value of c.Session is just a pointer.
If you create a new session it gets a new pointer address. When you assign the new session to c.Session, it just changes what c.Session points to. Anyone who already has a copy of the pointer address to the old *gocql.Session can keep using it.

robert-milan · 2020-01-08T15:21:47Z

If we don't keep the lock for the duration of that call then there is a high likelihood of encountering undefined behavior or a SEGFAULT because the old *session may have been replaced,

This wouldn't happen. c.Session a *gocql.Session. So the value of c.Session is just a pointer.
If you create a new session it gets a new pointer address. When you assign the new session to c.Session, it just changes what c.Session points to. Anyone who already has a copy of the pointer address to the old *gocql.Session can keep using it.

After reading over the code more I agree with you. I will make updates to this PR to simplify it so we don't need to hold locks open the entire time. Not sure what I was thinking with that line of reasoning exactly.

refactor logic in store

idx/cassandra/cassandra.go

store/cassandra/cassandra.go

cassandra/cassandra_session.go

store/cassandra/cassandra.go

cassandra/cassandra_session.go

idx/cassandra/cassandra.go

cassandra/cassandra_session.go

replay

LGTM just address @fitzoh's last comment

robert-milan force-pushed the recover-from-cassandra-catastrophe branch from dc42b90 to 145d324 Compare December 25, 2019 09:49

robert-milan changed the title ~~WIP: add catastrophe recovery for cassandra~~ add catastrophe recovery for cassandra Dec 25, 2019

robert-milan requested a review from replay December 25, 2019 16:03

replay reviewed Dec 26, 2019

View reviewed changes

store/cassandra/cassandra.go Outdated Show resolved Hide resolved

replay reviewed Dec 26, 2019

View reviewed changes

idx/cassandra/meta_records.go Outdated Show resolved Hide resolved

replay reviewed Dec 26, 2019

View reviewed changes

idx/cassandra/cassandra.go Outdated Show resolved Hide resolved

replay reviewed Dec 26, 2019

View reviewed changes

idx/cassandra/cassandra.go Outdated Show resolved Hide resolved

replay reviewed Dec 26, 2019

View reviewed changes

idx/cassandra/cassandra.go Outdated Show resolved Hide resolved

robert-milan requested review from fkaleo, Dieterbe and woodsaj December 30, 2019 11:42

robert-milan added 4 commits January 14, 2020 17:12

add catastrophe recovery to cassandra store

ab690dd

add locks

10e1b57

add logic to index

4ce40d4

refactor logic in store

fix waitgroup decrement in processWriteQueue

8bd21f8

robert-milan force-pushed the recover-from-cassandra-catastrophe branch from 845ba3b to 17d2658 Compare January 14, 2020 20:02

robert-milan added 3 commits January 14, 2020 21:19

make interval check and timeout configurable

f6e49bc

move cassandra session into cassandra in root

9125bfa

update log messages

4fb852f

robert-milan force-pushed the recover-from-cassandra-catastrophe branch from 17d2658 to 4fb852f Compare January 14, 2020 20:19

robert-milan requested a review from replay January 14, 2020 20:25