Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice: obtaining complete state after server outage #246

Open
jbanyer opened this issue Dec 1, 2024 · 11 comments
Open

Best practice: obtaining complete state after server outage #246

jbanyer opened this issue Dec 1, 2024 · 11 comments

Comments

@jbanyer
Copy link

jbanyer commented Dec 1, 2024

Since vehicles only send fields which have changed, if our server is down for a while then we will miss updates. There is also the possibility of entirely losing stored vehicle state due to a problem on our server.

After a server outage or restart, how should the complete vehicle state be obtained?

I noticed that deleting and recreating fleet telemetry config causes the vehicle to immediately send a telemetry record containing all configured fields. Is that an acceptable way to obtain the complete state?

If so, is is acceptable to do this to every vehicle connected to a service after an restart? Could be many thousands of vehicles.

Most of the time the issue will be a short outage, not a complete loss of server data, so the only issue is missing updates during the outage. It is unnecessary to request complete state for vehicles which sent no updates during the outage. Is it possible to detect for a given vehicle that updates have been missed, and then only trigger a full resend in that case?

Another idea would be to poll the vehicle using the polling API call. That would involve substantially higher costs, though probably not prohibitive (0.2 cents per vehicle after each restart).

@netdata-be
Copy link

Having an API command to trigger ALL configured fields would be nice indeed.

@Bre77
Copy link

Bre77 commented Dec 2, 2024

@jbanyer when a vehicle goes offline and comes back it does backfill its data, are you saying when your server goes down and the vehicle reconnects the same doesnt occur?

I have proof that vehicles backfill when they go offline, but I never take all my load balanced Fleet Telemetry servers down simultaneously so I dont know if it works there too.

@jbanyer
Copy link
Author

jbanyer commented Dec 3, 2024

@Bre77 I'm referring to when the third-party backend system (eg my system) is down for a while. During the time that it's down, vehicles will send field changes, and the backend will miss them.

There needs to be some way for a backend to aquire the missed updates when it comes back up.

Having a zero-downtime deploy process help avoid this situation, but all systems experience total outages occasionally, so there needs to be a method to get the missed updates.

@Adminius
Copy link

Adminius commented Dec 4, 2024

Simmilar question: what happens/what to do if the car is offline (like no connection in underground parking)?
We can miss gear changes, location and speed changes. Will this missed signals while car was offline be billed?

@Bre77
Copy link

Bre77 commented Dec 4, 2024

Simmilar question: what happens/what to do if the car is offline (like no connection in underground parking)?
We can miss gear changes, location and speed changes. Will this missed signals while car was offline be billed?

The car sends these signals as soon as it reconnects, so I would assume so.

@patrickdemers6
Copy link
Collaborator

The vehicle stores a buffer of messages (up to 5k messages currently) to be sent once the vehicle comes back online. This behavior is needed as otherwise the backend will not be able to reconcile the vehicle's state.

If the vehicle goes to sleep before reconnecting to the internet, buffered messages will not be sent and you won't be billed.

Server outage is a great question. There is not currently a way to force all data values to be sent. Please don't update fleet-telemetry configurations for all vehicles to trigger this.

I'm not guaranteeing any of these solutions but the ideas that come to my mind:

  • Whenever the vehicle reconnects, send everything.
    • Pros: simple
    • Cons: costly when vehicle, network, or server side issues cause reconnects.
  • Applications communicate to the fleet-telemetry server and request a given VIN sends all fields. The fleet-telemetry server sends this request down to the vehicle.

Thoughts or other ideas?

@jbanyer
Copy link
Author

jbanyer commented Dec 6, 2024

@patrickdemers6 thanks for your reply. I think it would be best if the application was in control, since it is best placed to know that a resync is required.

An API request which prompted the vehicle to resend all telemetry fields should work. Although perhaps there are other solutions.

The situation is probably fairly rare, especially if backends are using a persistent queue mechanism to hold telemetry records. Many developers may choose not to bother making use of a resync mechanism.

If telemetry records included some kind of sequence number, the backend could detect that a message has been missed and request a resync. But that would require adding a new field just to help with a rare situation? Although it may also be useful to handle race conditions in distributed systems?

Perhaps we'll have firmer ideas once we've all had more experience with using fleet telemetry at scale. Cheers.

@bassmaster187
Copy link

@patrickdemers6 I don't need all fields, just a couple of fields for our state machine like "Gear", "ChargeState" and maybe 2-3 others. So a configurable field list would be much more useful.
All fields from all cars at the same time would maybe overload the server.

@morganofslo
Copy link

@patrickdemers6 Is this buffering behavior you describe for all fields on the vehicle (so Tesla can build their own state) or only the fields we're subscribed to?

Also, what exactly does 'online' mean here in relation to buffering? Does the car buffer fields when the websocket connection is severed (perhaps due to an issue on our end) or only when it loses its internet connection?

@morganofslo
Copy link

I think a sequence id is a great idea; it will help us determine if a full sync is needed, but will also help detect partially stale states; for example, say we wanted to calculate voltage from amps and power; we could miss an update to amps but receive updated power data and we'd use a stale amp value to calculate voltage incorrectly.

@iainwhyte
Copy link

I think a sequence id is a great idea; it will help us determine if a full sync is needed, but will also help detect partially stale states;

Doesn't the existing timestamp on every update serve that purpose, @morganofslo ?
{"data":[{"key":"BatteryLevel","value":{"stringValue":"48.30769230769231"}}],"createdAt":"2024-12-06T20:30:36.378404662Z","vin":"LRWYH....03"}

If you requested this every 5 minutes, and your value is more than 5 minutes old... its stale?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants