(2.12) [ADDED] Offline assets support#7158
Conversation
|
While this is a step ahead of- at least streams don’t run with wrong config I think there is much to be desired. Are we now doing stream config loads with strict parsing? This should also fail loading right The ask is that the stream starts but in a new degraded way where it’s in stasis. Importantly it prevents other streams from being created with subject overlap and that users - who don’t have logs access - can see why this is not starting. So rather than a server saying it can’t start it it starts a stream in a special degraded manner - read only in all ways and safe guarding it’s subject space (but not listening) |
We could do this as an additional step, but not strictly required now that we have the API level. Could optionally add though.
This protection is there. Users will see that the stream can't be created due to overlap.
This is exactly how it's implemented. The server works fully, only that stream/consumer will be degraded. The assignment is there and queryable with stream/consumer info, guards the subject space still, but does not actually run. |
|
Will update this PR early next week to allow stream/consumer list/names requests. That's probably a bit too conservative/restrictive currently that an unsupported stream/consumer would be omitted from the result in some cases. |
|
Right I misread the PR description sorry, this is pretty cool. I was hoping the streams would be a bit more normally functional, meaning they would respond to stream INFO like a stream, not just an error. Perhaps it would load the JSON in what it can and populate fields but then have the reason for being offline additionally in the state but no messages etc shown in state. Info would be shown, stream list would include them etc, this way tooling largely works and the situation becomes immediately obvious. Where with just the error below, its really hard to know what's up unless you can access the logs |
fbddb17 to
5b43080
Compare
Have adjusted this PR to return Opened a CLI PR as well, as missing/inaccessible streams were not shown for stream ls/report: nats-io/natscli#1452 |
server/jetstream_cluster.go
Outdated
|
|
||
| func (wca *writeableConsumerAssignment) UnmarshalJSON(data []byte) error { | ||
| var ca consumerAssignment | ||
| if err := json.Unmarshal(data, &ca); err != nil { |
There was a problem hiding this comment.
We could go a step further here and decode the assignments in strict mode to catch unknown configuration fields too, WDYT?
dec := json.NewDecoder(bytes.NewReader(data))
dec.DisallowUnknownFields()
if err := dec.Decode(); err != nil { ... }
There was a problem hiding this comment.
Done.
We decode under strict and mark it as unsupported if it fails. We then still decode with json.Unmarshal to include as much info that's recognized to protect stream subjects space, respond to API requests, etc.
5b43080 to
031577d
Compare
|
Looking into how to respond to stream/consumer list requests with the |
9015e23 to
e525647
Compare
|
Have done both. So we now respond to stream/consumer list requests with the Additionally, as discussed with @ripienaar: |
0616f73 to
38ffbcf
Compare
|
We can have a call next week about it but I think there's still some problems with this we might want to discuss:
Mostly we need to think about the fact that it's not just humans looking at the results of these APIs, we need to not be doing something that could deliberately mislead client or app behaviours either. |
6053afc to
a47ff00
Compare
Offline assets will appear in the existing "missing" list for older tools and newer tools can access reasons in the new offline fields. Stream info will error. Supports nats-io/nats-server#7158 Signed-off-by: R.I.Pienaar <rip@devco.net>
a47ff00 to
9c1ea78
Compare
|
As discussed, updated this PR to respond with an error on info requests, and list requests contain a |
9c1ea78 to
8523390
Compare
|
|
||
| // If standalone/single-server, the offline reason needs to be stored directly in the consumer. | ||
| // Otherwise, if clustered it will be part of the consumer assignment. | ||
| offlineReason string |
There was a problem hiding this comment.
Any way to merge these so code is the same between single server and clustered?
There was a problem hiding this comment.
Think this is not easily or at all possible due to the differences between single server and clustered.
We use streamAssignment and consumerAssignment to host this data and set up an info sub when clustered. We don't use those structs at all as single server (and don't use that info sub), so need to keep them somewhere else.
| Consumers []*ConsumerInfo `json:"consumers"` | ||
| Missing []string `json:"missing,omitempty"` | ||
| Consumers []*ConsumerInfo `json:"consumers"` | ||
| Missing []string `json:"missing,omitempty"` |
There was a problem hiding this comment.
Are Missing and Offline kinda the same?
There was a problem hiding this comment.
They are different. Missing is represented as "inaccessible" in the CLI, and means a R1 stream where the server hosting it is offline, or R3 where all 3 servers are offline, or if there's currently no leader.
Offline means a 2.11 server came up and recognized it can't support 2.12 features for a stream/consumer. That reports the stream/consumer name as the key, and the value contains the error message containing the API level that that asset requires.
We do populate the Missing with all entries in Offline as well, since on the one hand they are technically missing in that case, but mostly because older tooling will not recognize Offline until upgraded.
There was a problem hiding this comment.
Afaik missing can also be assets who just didnt respond quick enough during the scatter-gather info gathering right? Doesn't mean they aren't actually up and functioning.
There was a problem hiding this comment.
Yeah, either didn't respond quickly enough due to RTT (perhaps), or more likely for all servers to be up but there's no quorum to elect a leader.
There was a problem hiding this comment.
Offline means Unsupported? Should we call it unsupported vs offline?
There was a problem hiding this comment.
Was thinking about this as well. Think the name primarily came from "running the stream in an Offline Mode".
What do you think @ripienaar?
There was a problem hiding this comment.
Unsupported is one possible reason for being offline. I imagine we can handle other reasons for being offline this way also - like subject clashes, corruption egc
There was a problem hiding this comment.
Good point actually. If a consumer is unsupported it also stops the stream. That stream is offline because it's stopped, not because it's unsupported.
So Offline makes more sense
6e98ae2 to
9bd79c4
Compare
| s.Warnf(" Detected unsupported consumer '%s > %s > %s', delete the consumer or upgrade the server to API level %s", a.Name, e.mset.name(), cfg.Name, apiLevel) | ||
| singleServerMode := !s.JetStreamIsClustered() && s.standAloneMode() | ||
| if singleServerMode { | ||
| if !e.mset.closed.Load() { |
There was a problem hiding this comment.
Can this condition really happen?
There was a problem hiding this comment.
Yes, we specifically want to stop the stream if it wasn't already.
The case is where the stream's config is fully supported. But we then encounter a consumer here that is NOT supported. That means we need to stop the stream and put it into the offline "stopped" mode.
So, this condition makes sure we stop the stream if it wasn't already, and put it into unsupported/offline/stopped mode. Stopping the stream as part of e.mset.stop below makes sure the stream will be mset.closed.Load() as well.
server/jetstream.go
Outdated
| e.mset.mu.Unlock() | ||
| } | ||
| continue | ||
| } else if strictErr != nil { |
There was a problem hiding this comment.
A bit confused about how this differs to the above, are we only marking the stream as "offline" due to metadata and not due to unknown config fields in single server mode here?
There was a problem hiding this comment.
are we only marking the stream as "offline" due to metadata and not due to unknown config fields in single server mode here?
Correct
We could also put the stream/consumer into offline mode if strict decode failed BUT the feature API level should be supported?
In that case we should use a different offlineReason.
There was a problem hiding this comment.
It's probably better if we populate the offline reason regardless, either with the API level (first) or the decoding error (second), just to be absolutely sure we can't footgun this again in the future.
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
9bd79c4 to
eebc56c
Compare
Includes the following: - #7200 - #7201 - #7202 - #7209 - #7210 - #7211 - #7213 - #7212 - #7216 - #7217 - #7230 - #7239 - #7246 - #7248 - 8241a15, specifically delayed errors that are not JS API errors - #7158 (not containing 2.12-specific changes) - #7233 - #7255 - #7249 - #7259 - #7265 - #7273 (not including Go 1.25.x) - #7258 - #7222 Signed-off-by: Maurice van Veen <github@mauricevanveen.com> Signed-off-by: Neil Twigg <neil@nats.io>
Offline assets will appear in the existing "missing" list for older tools and newer tools can access reasons in the new offline fields. Stream info will error. Supports nats-io/nats-server#7158 Signed-off-by: R.I.Pienaar <rip@devco.net>
Implements Offline Assets support from ADR-44. Streams and consumers will be put in "offline mode" if the server doesn't support the required API level.
Example:
Stream/consumer list and names requests still return the names of these streams. Stream/consumer info still works in this state, and a stream/consumer can still be deleted as well. The "offline reason" is included as an error in the stream/consumer info requests and is contained in the
offline: map[string]stringfield in the list requests.These issues will be logged during startup, if an update request is performed. The raw JSON is always preserved, so downgrading to a server that doesn't understand the new JSON fields can safely upgrade and the JSON will be preserved as expected. Importantly, if a stream or consumer is recognized as unsupported, they are not even loaded by the server. This means data will be safe and not even be loaded on an older version server.
This will allow for more graceful downgrades from 2.12 to 2.11 (and onward), where loads of new features would stop functioning after a downgrade: counter-streams, atomic batch, etc.
Signed-off-by: Maurice van Veen github@mauricevanveen.com