You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.
we have a few bugs that manifest when doing a rolling upgrade of storage-schemas.conf to a cluster. During the upgrade, different nodes will have different storage-schemas.conf rules and thus a same value of schemaId will mean different things on different nodes, which means:
instances receiving queries use schemaId to lookup retention and resolution in alignRequests. thus "our last rollout added some new retentions and queries were messed up until the rollout completed (at which point everything was ok)" per @shanson7 (note: can be worked around by doing a blue/green deployment. we can chose to make this merely a documentation bug)
write nodes will trigger panic if they receive a chunk persist message for a span/rollup they don't recognize, which may happen if you change storage schemas or storage-aggregations.conf
not an issue yet, but when we ever add spec-exec for render requests (or a failover mechanism that retries a failed render on the other replica), schemaId may be off (similar note as in 1 applies here)
TODO: track IrId, AggId through the source code and see if there's similar issues with them
The text was updated successfully, but these errors were encountered:
dieter7:12 PM
to address the alignRequests one, each node could return their version of storage-schemas.conf ,and alignRequests could theoretically work with that. but it sounds like more hassle than it's worth. i'm inclined to say if you need to make changes to storage-schemas.conf use a blue/green deployment or switch over traffic to the other cluster which is pretty much the same thing
7:12 PM
the write node issue seems pretty critical though
7:14 PM
I think we should also add a section to the operations guide about when to do a blue/green style deploy vs an in-place upgrade
dieter [7:20 PM]
the main downside of blue/green (or running a 2nd cluster) is you need to double your read instances at least temporarily or your entire cluster respectively, which is not something we plan to solve in a while I think, realistically
so as long as we know that there will be scenarios in which this type of upgrade is needed (e.g. major clustering changes), there's not much to be gained by avoiding it for scenarios like this one.
we have a few bugs that manifest when doing a rolling upgrade of storage-schemas.conf to a cluster. During the upgrade, different nodes will have different storage-schemas.conf rules and thus a same value of schemaId will mean different things on different nodes, which means:
instances receiving queries use schemaId to lookup retention and resolution in alignRequests. thus "our last rollout added some new retentions and queries were messed up until the rollout completed (at which point everything was ok)" per @shanson7 (note: can be worked around by doing a blue/green deployment. we can chose to make this merely a documentation bug)
write nodes will trigger panic if they receive a chunk persist message for a span/rollup they don't recognize, which may happen if you change storage schemas or storage-aggregations.conf
not an issue yet, but when we ever add spec-exec for render requests (or a failover mechanism that retries a failed render on the other replica), schemaId may be off (similar note as in 1 applies here)
TODO: track IrId, AggId through the source code and see if there's similar issues with them
The text was updated successfully, but these errors were encountered: