Create canonical auction ids in api / How to handle pod auto scaling? #175

vkgnosis · 2022-04-28T08:28:43Z

Currently the driver implicitly creates batches by fetching the auction from the api at some interval. The auction in the api is updated on demand. We want to move this into the api so that we can eventually have the whole solution competition there too (#127) .

The canonical auction id and the competition are both operations that don't make sense to run in multiple pods like we would with our current auto scaling configuration. We need to figure out how we want this work. My solution to this that I discussed with Nic a bit already is that we would have some routes that are autoscalable and some that aren't. The only route we have now that wouldn't auto scale is the auction route and in the future some routes that related to solution competition like "get current competition", "get winner".

How should this work on a kubernetes level? As a temporary solution we can set the max replicas to 1. I think this is fine because even a single pod does scale quite well as most of our work consists of forwarding requests to other servers. Long term it would be nice to keep the scaling for the other routes.

We can probably achieve this with a kubernetes / nginx config that picks the target deployment based on the route so that get_price_estimate goes to a scaling deployment while get_auction goes to a non scaling deployment of the same api container that we currently use. All the pods would technically be running all apis but external requests would only go to one of the deployments based on the path.

In addition (alternatively?) we could create different containers or command line switches that configure an api pod with what operations it should handle (scalable, non scalable, both). I don't think there is need for this now but it might be useful in the future to disable something like the auto updating native price cache if it is only need for some parts of the api.

vkgnosis · 2022-04-28T09:36:28Z

Another solution could be to have all externally reachable endpoints be run by auto scaling pods while having an internal worker pod performing the inherently unparallelizable tasks. For example the worker task would update the auction id, store the current active auction etc in the database which is read from the database by the externally reachable pods. Likewise this task would store solution competition information in the database so it can be read by the externally reachable pods.

This has the advantage that no matter how bad a ddos is we will still be able to run these tasks because they are separate from all external requests.

Thinking more about this what I'm describing is basically the driver but with DB access. Previously we have thought about the driver as a thing that anyone could run themselves (coming from gp-v1) but here it has evolved into more of a backend component as it is responsible for querying the external solvers and submitting the transactions. (This nicely fits with one idea we had to implement the visible solution competition by having the driver upload its current competition state to the api by making the "uploading" shared DB access.)

I would call this new type of driver api-driver. To easily run the system locally the api binary can run the api-driver too (think of it as a rust async fn) but on kubernetes they would be separate deployments where the api scales and the api-driver is always a single pod (that is not externally reachable).

(This is very promising idea to me and was a big realization just now so I hope I'm explaining it right.)

nlordell · 2022-04-28T13:46:58Z

The solution proposed in the comment above is a neat idea. So we could de-couple the backend "driver" services needed for doing auctions/settlement competition/database maintenance from pods that use the database only for adding orders and reading&serving data.

This would mean that we would be able to continue operating for users that have submitted orders (although, under DDoS, it might still be possible to prevent users from placing orders by DDoS-ing price estimation and order placement that are required by the FE).

Personally, I am in favour of this change.

nlordell · 2022-04-28T13:49:56Z

One question about "comment solution" - how do you envision external solvers running off-premise to participate? Would they query a driver-api pod for the current auction and push new settlements to some driver-api endpoint that would add it to the database or would the driver pod poll them as we do today?

vkgnosis · 2022-04-28T14:05:27Z

I'm not sure. With the current model the api-driver would be the one polling them so that keeps working as is.
If we wanted it to be push based it could probably still be done through the database. The api pods can store the solutions in it and the api-driver retrieves them from there. I would stay with the current approach for a while though to not make it more complicated too quickly.

vkgnosis · 2022-04-29T10:31:53Z

One problem I'm thinking about now is how DB migrations will work. Currently we have this init container set up and there is only one deployment that uses the DB. If there are multiple containers then we could give both of them the init container but if one runs first it would break the other deployment. It is not a huge problem as all the containers should be auto deployed at roughly the same time but it still doesn't feel nice.

MartinquaXD · 2022-04-29T10:46:38Z

One problem I'm thinking about now is how DB migrations will work...

I guess one solution could be to only give the "internal worker pod" (as you called it) write access to the DB and make everybody else push their updates through its API?
That what put the burden of authentication on the worker pod so depending on how difficult that is to do it correctly/securely this might not be desirable.

vkgnosis · 2022-04-29T10:48:01Z

If everything had to go through that pod it would defeat the point of scaling.
On the other hand I think we already run into this DB problem because the old api pods stay up until the new one is healthy so it can already happen that the migrations run and prevent the old pods from working.

MartinquaXD · 2022-04-29T10:55:56Z

If everything had to go through that pod it would defeat the point of scaling.

How many writes to the database do you expect from the new component? I thought it would be mostly read heavy.
But I guess it would also be kind of weird to give out read access to the DB and only force writes to go through some other pods because of DB migration issues.
I'm not sure about how difficult it would be to DDoS a DB with writes compared to DDoS-ing a pod which only forwards the writes to the DB.
I think I have to read up more closely on your idea.

vkgnosis · 2022-09-06T07:59:45Z

Has been implemented.

vkgnosis changed the title ~~Create canonical auction ids in ap~~ Create canonical auction ids in api / How to handle pod auto scaling? Apr 28, 2022

nlordell mentioned this issue May 3, 2022

Document solver competition information route #185

Merged

nlordell mentioned this issue May 25, 2022

Solver+Driver Co-Location #230

Closed

vkgnosis mentioned this issue Jun 23, 2022

Introduce initial autopilot binary #304

Merged

vkgnosis closed this as completed Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create canonical auction ids in api / How to handle pod auto scaling? #175

Create canonical auction ids in api / How to handle pod auto scaling? #175

vkgnosis commented Apr 28, 2022

vkgnosis commented Apr 28, 2022

nlordell commented Apr 28, 2022 •

edited

Loading

nlordell commented Apr 28, 2022

vkgnosis commented Apr 28, 2022

vkgnosis commented Apr 29, 2022

MartinquaXD commented Apr 29, 2022

vkgnosis commented Apr 29, 2022

MartinquaXD commented Apr 29, 2022

vkgnosis commented Sep 6, 2022

Create canonical auction ids in api / How to handle pod auto scaling? #175

Create canonical auction ids in api / How to handle pod auto scaling? #175

Comments

vkgnosis commented Apr 28, 2022

vkgnosis commented Apr 28, 2022

nlordell commented Apr 28, 2022 • edited Loading

nlordell commented Apr 28, 2022

vkgnosis commented Apr 28, 2022

vkgnosis commented Apr 29, 2022

MartinquaXD commented Apr 29, 2022

vkgnosis commented Apr 29, 2022

MartinquaXD commented Apr 29, 2022

vkgnosis commented Sep 6, 2022

nlordell commented Apr 28, 2022 •

edited

Loading