issues with schemaId / storage-schemas.conf in place cluster upgrades #1137

Dieterbe · 2018-11-13T18:04:46Z

we have a few bugs that manifest when doing a rolling upgrade of storage-schemas.conf to a cluster. During the upgrade, different nodes will have different storage-schemas.conf rules and thus a same value of schemaId will mean different things on different nodes, which means:

instances receiving queries use schemaId to lookup retention and resolution in alignRequests. thus "our last rollout added some new retentions and queries were messed up until the rollout completed (at which point everything was ok)" per @shanson7 (note: can be worked around by doing a blue/green deployment. we can chose to make this merely a documentation bug)
write nodes will trigger panic if they receive a chunk persist message for a span/rollup they don't recognize, which may happen if you change storage schemas or storage-aggregations.conf
not an issue yet, but when we ever add spec-exec for render requests (or a failover mechanism that retries a failed render on the other replica), schemaId may be off (similar note as in 1 applies here)

TODO: track IrId, AggId through the source code and see if there's similar issues with them

Dieterbe · 2018-11-13T18:14:56Z

dieter7:12 PM
to address the alignRequests one, each node could return their version of storage-schemas.conf ,and alignRequests could theoretically work with that. but it sounds like more hassle than it's worth. i'm inclined to say if you need to make changes to storage-schemas.conf use a blue/green deployment or switch over traffic to the other cluster which is pretty much the same thing
7:12 PM
the write node issue seems pretty critical though
7:14 PM
I think we should also add a section to the operations guide about when to do a blue/green style deploy vs an in-place upgrade
dieter [7:20 PM]
the main downside of blue/green (or running a 2nd cluster) is you need to double your read instances at least temporarily or your entire cluster respectively, which is not something we plan to solve in a while I think, realistically
so as long as we know that there will be scenarios in which this type of upgrade is needed (e.g. major clustering changes), there's not much to be gained by avoiding it for scenarios like this one.

Dieterbe added the customer-impacting label Nov 13, 2018

Dieterbe added this to the 1.0 milestone Nov 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issues with schemaId / storage-schemas.conf in place cluster upgrades #1137

issues with schemaId / storage-schemas.conf in place cluster upgrades #1137

Dieterbe commented Nov 13, 2018 •

edited

Loading

Dieterbe commented Nov 13, 2018 •

edited

Loading

issues with schemaId / storage-schemas.conf in place cluster upgrades #1137

issues with schemaId / storage-schemas.conf in place cluster upgrades #1137

Comments

Dieterbe commented Nov 13, 2018 • edited Loading

Dieterbe commented Nov 13, 2018 • edited Loading

Dieterbe commented Nov 13, 2018 •

edited

Loading

Dieterbe commented Nov 13, 2018 •

edited

Loading