Best practice: obtaining complete state after server outage #246

jbanyer · 2024-12-01T03:01:13Z

Since vehicles only send fields which have changed, if our server is down for a while then we will miss updates. There is also the possibility of entirely losing stored vehicle state due to a problem on our server.

After a server outage or restart, how should the complete vehicle state be obtained?

I noticed that deleting and recreating fleet telemetry config causes the vehicle to immediately send a telemetry record containing all configured fields. Is that an acceptable way to obtain the complete state?

If so, is is acceptable to do this to every vehicle connected to a service after an restart? Could be many thousands of vehicles.

Most of the time the issue will be a short outage, not a complete loss of server data, so the only issue is missing updates during the outage. It is unnecessary to request complete state for vehicles which sent no updates during the outage. Is it possible to detect for a given vehicle that updates have been missed, and then only trigger a full resend in that case?

Another idea would be to poll the vehicle using the polling API call. That would involve substantially higher costs, though probably not prohibitive (0.2 cents per vehicle after each restart).

netdata-be · 2024-12-02T06:57:55Z

Having an API command to trigger ALL configured fields would be nice indeed.

Bre77 · 2024-12-02T23:42:42Z

@jbanyer when a vehicle goes offline and comes back it does backfill its data, are you saying when your server goes down and the vehicle reconnects the same doesnt occur?

I have proof that vehicles backfill when they go offline, but I never take all my load balanced Fleet Telemetry servers down simultaneously so I dont know if it works there too.

jbanyer · 2024-12-03T00:41:42Z

@Bre77 I'm referring to when the third-party backend system (eg my system) is down for a while. During the time that it's down, vehicles will send field changes, and the backend will miss them.

There needs to be some way for a backend to aquire the missed updates when it comes back up.

Having a zero-downtime deploy process help avoid this situation, but all systems experience total outages occasionally, so there needs to be a method to get the missed updates.

Adminius · 2024-12-04T10:10:43Z

Simmilar question: what happens/what to do if the car is offline (like no connection in underground parking)?
We can miss gear changes, location and speed changes. Will this missed signals while car was offline be billed?

Bre77 · 2024-12-04T11:49:14Z

Simmilar question: what happens/what to do if the car is offline (like no connection in underground parking)?
We can miss gear changes, location and speed changes. Will this missed signals while car was offline be billed?

The car sends these signals as soon as it reconnects, so I would assume so.

patrickdemers6 · 2024-12-05T03:44:26Z

The vehicle stores a buffer of messages (up to 5k messages currently) to be sent once the vehicle comes back online. This behavior is needed as otherwise the backend will not be able to reconcile the vehicle's state.

If the vehicle goes to sleep before reconnecting to the internet, buffered messages will not be sent and you won't be billed.

Server outage is a great question. There is not currently a way to force all data values to be sent. Please don't update fleet-telemetry configurations for all vehicles to trigger this.

I'm not guaranteeing any of these solutions but the ideas that come to my mind:

Whenever the vehicle reconnects, send everything.
- Pros: simple
- Cons: costly when vehicle, network, or server side issues cause reconnects.
Applications communicate to the fleet-telemetry server and request a given VIN sends all fields. The fleet-telemetry server sends this request down to the vehicle.
- Pro: puts the application in control
- Con: adds one more thing to think about when integrating with fleet telemetry. This could potentially have an interface like I proposed in stale PR introduce data connectors, used to validate if vin is allowed to connect #177 (though an implementation like that would only handle on connect, not periodic true-ups).

Thoughts or other ideas?

jbanyer · 2024-12-06T04:16:00Z

@patrickdemers6 thanks for your reply. I think it would be best if the application was in control, since it is best placed to know that a resync is required.

An API request which prompted the vehicle to resend all telemetry fields should work. Although perhaps there are other solutions.

The situation is probably fairly rare, especially if backends are using a persistent queue mechanism to hold telemetry records. Many developers may choose not to bother making use of a resync mechanism.

If telemetry records included some kind of sequence number, the backend could detect that a message has been missed and request a resync. But that would require adding a new field just to help with a rare situation? Although it may also be useful to handle race conditions in distributed systems?

Perhaps we'll have firmer ideas once we've all had more experience with using fleet telemetry at scale. Cheers.

bassmaster187 · 2024-12-06T07:28:16Z

@patrickdemers6 I don't need all fields, just a couple of fields for our state machine like "Gear", "ChargeState" and maybe 2-3 others. So a configurable field list would be much more useful.
All fields from all cars at the same time would maybe overload the server.

morganofslo · 2024-12-06T20:31:53Z

@patrickdemers6 Is this buffering behavior you describe for all fields on the vehicle (so Tesla can build their own state) or only the fields we're subscribed to?

Also, what exactly does 'online' mean here in relation to buffering? Does the car buffer fields when the websocket connection is severed (perhaps due to an issue on our end) or only when it loses its internet connection?

morganofslo · 2024-12-06T20:54:32Z

I think a sequence id is a great idea; it will help us determine if a full sync is needed, but will also help detect partially stale states; for example, say we wanted to calculate voltage from amps and power; we could miss an update to amps but receive updated power data and we'd use a stale amp value to calculate voltage incorrectly.

iainwhyte · 2024-12-06T21:07:18Z

I think a sequence id is a great idea; it will help us determine if a full sync is needed, but will also help detect partially stale states;

Doesn't the existing timestamp on every update serve that purpose, @morganofslo ?
{"data":[{"key":"BatteryLevel","value":{"stringValue":"48.30769230769231"}}],"createdAt":"2024-12-06T20:30:36.378404662Z","vin":"LRWYH....03"}

If you requested this every 5 minutes, and your value is more than 5 minutes old... its stale?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice: obtaining complete state after server outage #246

Best practice: obtaining complete state after server outage #246

jbanyer commented Dec 1, 2024 •

edited

Loading

netdata-be commented Dec 2, 2024

Bre77 commented Dec 2, 2024 •

edited

Loading

jbanyer commented Dec 3, 2024

Adminius commented Dec 4, 2024

Bre77 commented Dec 4, 2024

patrickdemers6 commented Dec 5, 2024

jbanyer commented Dec 6, 2024 •

edited

Loading

bassmaster187 commented Dec 6, 2024

morganofslo commented Dec 6, 2024

morganofslo commented Dec 6, 2024

iainwhyte commented Dec 6, 2024

Best practice: obtaining complete state after server outage #246

Best practice: obtaining complete state after server outage #246

Comments

jbanyer commented Dec 1, 2024 • edited Loading

netdata-be commented Dec 2, 2024

Bre77 commented Dec 2, 2024 • edited Loading

jbanyer commented Dec 3, 2024

Adminius commented Dec 4, 2024

Bre77 commented Dec 4, 2024

patrickdemers6 commented Dec 5, 2024

jbanyer commented Dec 6, 2024 • edited Loading

bassmaster187 commented Dec 6, 2024

morganofslo commented Dec 6, 2024

morganofslo commented Dec 6, 2024

iainwhyte commented Dec 6, 2024

jbanyer commented Dec 1, 2024 •

edited

Loading

Bre77 commented Dec 2, 2024 •

edited

Loading

jbanyer commented Dec 6, 2024 •

edited

Loading