Skip to content

(2.12) [ADDED] Offline assets support#7158

Merged
neilalexander merged 2 commits intomainfrom
maurice/offline-assets
Aug 26, 2025
Merged

(2.12) [ADDED] Offline assets support#7158
neilalexander merged 2 commits intomainfrom
maurice/offline-assets

Conversation

@MauriceVanVeen
Copy link
Copy Markdown
Member

@MauriceVanVeen MauriceVanVeen commented Aug 8, 2025

Implements Offline Assets support from ADR-44. Streams and consumers will be put in "offline mode" if the server doesn't support the required API level.

Example:

[INF] ---------------- JETSTREAM ----------------
...
[INF]   API Level:       2
[INF] -------------------------------------------
[WRN]   Detected unsupported stream 'js > test-stream', delete the stream or upgrade the server to API level 10
[INF] Starting JetStream cluster
[INF] Creating JetStream metadata controller
[INF] JetStream cluster recovering state
[INF] Listening for client connections on 0.0.0.0:4444
[WRN] Detected unsupported stream 'js > test-stream', delete the stream or upgrade the server to API level 10
[INF] Server is ready
[INF] Cluster name is nats-cluster

Stream/consumer list and names requests still return the names of these streams. Stream/consumer info still works in this state, and a stream/consumer can still be deleted as well. The "offline reason" is included as an error in the stream/consumer info requests and is contained in the offline: map[string]string field in the list requests.

These issues will be logged during startup, if an update request is performed. The raw JSON is always preserved, so downgrading to a server that doesn't understand the new JSON fields can safely upgrade and the JSON will be preserved as expected. Importantly, if a stream or consumer is recognized as unsupported, they are not even loaded by the server. This means data will be safe and not even be loaded on an older version server.

This will allow for more graceful downgrades from 2.12 to 2.11 (and onward), where loads of new features would stop functioning after a downgrade: counter-streams, atomic batch, etc.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner August 8, 2025 19:56
@ripienaar
Copy link
Copy Markdown
Contributor

ripienaar commented Aug 8, 2025

While this is a step ahead of- at least streams don’t run with wrong config I think there is much to be desired. Are we now doing stream config loads with strict parsing? This should also fail loading right

The ask is that the stream starts but in a new degraded way where it’s in stasis. Importantly it prevents other streams from being created with subject overlap and that users - who don’t have logs access - can see why this is not starting.

So rather than a server saying it can’t start it it starts a stream in a special degraded manner - read only in all ways and safe guarding it’s subject space (but not listening)

@MauriceVanVeen
Copy link
Copy Markdown
Member Author

Are we now doing stream config loads with strict parsing? This should also fail loading right

We could do this as an additional step, but not strictly required now that we have the API level. Could optionally add though.

The ask is that the stream starts but in a new degraded way where it’s in stasis. Importantly it prevents other streams from being created with subject overlap and that users - who don’t have logs access - can see why this is not starting.

This protection is there. Users will see that the stream can't be created due to overlap.

So rather than a server saying it can’t start it it starts a stream in a special degraded manner - read only in all ways and safe guarding it’s subject space (but not listening)

This is exactly how it's implemented. The server works fully, only that stream/consumer will be degraded. The assignment is there and queryable with stream/consumer info, guards the subject space still, but does not actually run.

@MauriceVanVeen
Copy link
Copy Markdown
Member Author

Will update this PR early next week to allow stream/consumer list/names requests. That's probably a bit too conservative/restrictive currently that an unsupported stream/consumer would be omitted from the result in some cases.

@ripienaar
Copy link
Copy Markdown
Contributor

ripienaar commented Aug 9, 2025

Right I misread the PR description sorry, this is pretty cool.

I was hoping the streams would be a bit more normally functional, meaning they would respond to stream INFO like a stream, not just an error. Perhaps it would load the JSON in what it can and populate fields but then have the reason for being offline additionally in the state but no messages etc shown in state. Info would be shown, stream list would include them etc, this way tooling largely works and the situation becomes immediately obvious.

Where with just the error below, its really hard to know what's up unless you can access the logs

14:49:15 <<< $JS.API.STREAM.INFO.COUNTER
{"type":"io.nats.jetstream.api.v1.stream_info_response","error":{"code":500,"err_code":10118,"description":"stream is offline"},"total":0,"offset":0,"limit":0}

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/offline-assets branch from fbddb17 to 5b43080 Compare August 11, 2025 08:21
@MauriceVanVeen
Copy link
Copy Markdown
Member Author

I was hoping the streams would be a bit more normally functional, meaning they would respond to stream INFO like a stream, not just an error. Perhaps it would load the JSON in what it can and populate fields but then have the reason for being offline additionally in the state but no messages etc shown in state. Info would be shown, stream list would include them etc, this way tooling largely works and the situation becomes immediately obvious.

Have adjusted this PR to return Offline: true, OfflineReason: <reason> instead of the offline error for stream/consumer info. Stream/consumer list and names requests now also include them.

Opened a CLI PR as well, as missing/inaccessible streams were not shown for stream ls/report: nats-io/natscli#1452


func (wca *writeableConsumerAssignment) UnmarshalJSON(data []byte) error {
var ca consumerAssignment
if err := json.Unmarshal(data, &ca); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could go a step further here and decode the assignments in strict mode to catch unknown configuration fields too, WDYT?

dec := json.NewDecoder(bytes.NewReader(data))
dec.DisallowUnknownFields()
if err := dec.Decode(); err != nil { ... }

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
We decode under strict and mark it as unsupported if it fails. We then still decode with json.Unmarshal to include as much info that's recognized to protect stream subjects space, respond to API requests, etc.

@MauriceVanVeen
Copy link
Copy Markdown
Member Author

Looking into how to respond to stream/consumer list requests with the OfflineReason instead of it being included as missing.
And, need to figure out how to handle standalone servers.

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/offline-assets branch 3 times, most recently from 9015e23 to e525647 Compare August 14, 2025 22:46
@MauriceVanVeen
Copy link
Copy Markdown
Member Author

Have done both. So we now respond to stream/consumer list requests with the OfflineReason. Standalone/single-server now also knows not to load the stream or consumer if it's unsupported.

Additionally, as discussed with @ripienaar:
If an unsupported stream or consumer is detected, the stream is stopped to ensure all state will freeze in-place. This is particularly important if you have a interest-based stream with a single consumer that has interest. Keeping the consumer interest alive but having it not be usable would keep growing the stream size. Not registering the consumer interest would mean messages would be ingested but immediately dropped. Stopping the entire stream in this case is the safest. Any apps, either publishing or consuming, will know they are using now unsupported assets. And the server will either need to be upgraded, or the relevant stream/consumer will need to be deleted.

@MauriceVanVeen MauriceVanVeen marked this pull request as ready for review August 14, 2025 22:56
Copy link
Copy Markdown
Contributor

@ripienaar ripienaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will ask @ploubser next week to resurrect his branches and test for us also

@MauriceVanVeen MauriceVanVeen changed the title [ADDED] Offline assets support (2.12) [ADDED] Offline assets support Aug 15, 2025
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/offline-assets branch from 0616f73 to 38ffbcf Compare August 15, 2025 12:12
@neilalexander
Copy link
Copy Markdown
Member

neilalexander commented Aug 15, 2025

We can have a call next week about it but I think there's still some problems with this we might want to discuss:

  • Returning anything that looks like a properly-formed stream info or consumer info is misleading to applications/clients if they aren't specifically aware of the Offline and OfflineReason fields, a stream info in this case could be interpreted by an app as though you just created a stream and it's still empty
  • I also don't think we really want to have to train clients/apps to expect those fields either given how unusual the situation would be, it would be better to define a JS API error code in this case for the info requests, we can include the offline reason in that but anything not expecting an error should puke in the right way in that case
  • Stream list/report and consumer list/report likewise shouldn't try to return unsupported assets with faked states in the streams or consumers keys either, for the same reason as above, we currently have missing, we could reuse that or define another unsupported key there that includes reasons
  • May want to think about which non-list/report/info endpoints should return anything at all vs just not responding and timing out

Mostly we need to think about the fact that it's not just humans looking at the results of these APIs, we need to not be doing something that could deliberately mislead client or app behaviours either.

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/offline-assets branch 2 times, most recently from 6053afc to a47ff00 Compare August 19, 2025 09:51
ripienaar added a commit to ripienaar/jsm.go that referenced this pull request Aug 19, 2025
Offline assets will appear in the existing "missing" list
for older tools and newer tools can access reasons in the new
offline fields.

Stream info will error.

Supports nats-io/nats-server#7158

Signed-off-by: R.I.Pienaar <rip@devco.net>
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/offline-assets branch from a47ff00 to 9c1ea78 Compare August 19, 2025 17:29
@MauriceVanVeen
Copy link
Copy Markdown
Member Author

As discussed, updated this PR to respond with an error on info requests, and list requests contain a offline: map[string]string mapping asset name to the offline reason. missing will also always contain these streams to support older tools that don't recognize the new offline field.

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/offline-assets branch from 9c1ea78 to 8523390 Compare August 20, 2025 08:58

// If standalone/single-server, the offline reason needs to be stored directly in the consumer.
// Otherwise, if clustered it will be part of the consumer assignment.
offlineReason string
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to merge these so code is the same between single server and clustered?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this is not easily or at all possible due to the differences between single server and clustered.

We use streamAssignment and consumerAssignment to host this data and set up an info sub when clustered. We don't use those structs at all as single server (and don't use that info sub), so need to keep them somewhere else.

Consumers []*ConsumerInfo `json:"consumers"`
Missing []string `json:"missing,omitempty"`
Consumers []*ConsumerInfo `json:"consumers"`
Missing []string `json:"missing,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are Missing and Offline kinda the same?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are different. Missing is represented as "inaccessible" in the CLI, and means a R1 stream where the server hosting it is offline, or R3 where all 3 servers are offline, or if there's currently no leader.

Offline means a 2.11 server came up and recognized it can't support 2.12 features for a stream/consumer. That reports the stream/consumer name as the key, and the value contains the error message containing the API level that that asset requires.

We do populate the Missing with all entries in Offline as well, since on the one hand they are technically missing in that case, but mostly because older tooling will not recognize Offline until upgraded.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaik missing can also be assets who just didnt respond quick enough during the scatter-gather info gathering right? Doesn't mean they aren't actually up and functioning.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, either didn't respond quickly enough due to RTT (perhaps), or more likely for all servers to be up but there's no quorum to elect a leader.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline means Unsupported? Should we call it unsupported vs offline?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking about this as well. Think the name primarily came from "running the stream in an Offline Mode".
What do you think @ripienaar?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsupported is one possible reason for being offline. I imagine we can handle other reasons for being offline this way also - like subject clashes, corruption egc

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point actually. If a consumer is unsupported it also stops the stream. That stream is offline because it's stopped, not because it's unsupported.
So Offline makes more sense

@MauriceVanVeen MauriceVanVeen force-pushed the maurice/offline-assets branch 2 times, most recently from 6e98ae2 to 9bd79c4 Compare August 26, 2025 08:53
s.Warnf(" Detected unsupported consumer '%s > %s > %s', delete the consumer or upgrade the server to API level %s", a.Name, e.mset.name(), cfg.Name, apiLevel)
singleServerMode := !s.JetStreamIsClustered() && s.standAloneMode()
if singleServerMode {
if !e.mset.closed.Load() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this condition really happen?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we specifically want to stop the stream if it wasn't already.

The case is where the stream's config is fully supported. But we then encounter a consumer here that is NOT supported. That means we need to stop the stream and put it into the offline "stopped" mode.

So, this condition makes sure we stop the stream if it wasn't already, and put it into unsupported/offline/stopped mode. Stopping the stream as part of e.mset.stop below makes sure the stream will be mset.closed.Load() as well.

e.mset.mu.Unlock()
}
continue
} else if strictErr != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit confused about how this differs to the above, are we only marking the stream as "offline" due to metadata and not due to unknown config fields in single server mode here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we only marking the stream as "offline" due to metadata and not due to unknown config fields in single server mode here?

Correct

We could also put the stream/consumer into offline mode if strict decode failed BUT the feature API level should be supported?
In that case we should use a different offlineReason.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better if we populate the offline reason regardless, either with the API level (first) or the decoding error (second), just to be absolutely sure we can't footgun this again in the future.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/offline-assets branch from 9bd79c4 to eebc56c Compare August 26, 2025 13:08
Copy link
Copy Markdown
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@neilalexander neilalexander merged commit 4598607 into main Aug 26, 2025
89 of 92 checks passed
@neilalexander neilalexander deleted the maurice/offline-assets branch August 26, 2025 14:39
neilalexander added a commit that referenced this pull request Sep 8, 2025
Includes the following:
- #7200
- #7201
- #7202
- #7209
- #7210
- #7211
- #7213
- #7212
- #7216
- #7217
- #7230
- #7239
- #7246
- #7248
-
8241a15,
specifically delayed errors that are not JS API errors
- #7158 (not containing
2.12-specific changes)
- #7233
- #7255
- #7249
- #7259
- #7265
- #7273 (not including Go
1.25.x)
- #7258
- #7222

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Signed-off-by: Neil Twigg <neil@nats.io>
alexbozhenko pushed a commit to alexbozhenko/jsm.go that referenced this pull request Sep 19, 2025
Offline assets will appear in the existing "missing" list
for older tools and newer tools can access reasons in the new
offline fields.

Stream info will error.

Supports nats-io/nats-server#7158

Signed-off-by: R.I.Pienaar <rip@devco.net>
@MauriceVanVeen MauriceVanVeen linked an issue Mar 27, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disabled JetStream Assets

4 participants