Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create canonical auction ids in api / How to handle pod auto scaling? #175

Closed
vkgnosis opened this issue Apr 28, 2022 · 9 comments
Closed

Comments

@vkgnosis
Copy link
Contributor

Currently the driver implicitly creates batches by fetching the auction from the api at some interval. The auction in the api is updated on demand. We want to move this into the api so that we can eventually have the whole solution competition there too (#127) .

The canonical auction id and the competition are both operations that don't make sense to run in multiple pods like we would with our current auto scaling configuration. We need to figure out how we want this work. My solution to this that I discussed with Nic a bit already is that we would have some routes that are autoscalable and some that aren't. The only route we have now that wouldn't auto scale is the auction route and in the future some routes that related to solution competition like "get current competition", "get winner".

How should this work on a kubernetes level? As a temporary solution we can set the max replicas to 1. I think this is fine because even a single pod does scale quite well as most of our work consists of forwarding requests to other servers. Long term it would be nice to keep the scaling for the other routes.

We can probably achieve this with a kubernetes / nginx config that picks the target deployment based on the route so that get_price_estimate goes to a scaling deployment while get_auction goes to a non scaling deployment of the same api container that we currently use. All the pods would technically be running all apis but external requests would only go to one of the deployments based on the path.

In addition (alternatively?) we could create different containers or command line switches that configure an api pod with what operations it should handle (scalable, non scalable, both). I don't think there is need for this now but it might be useful in the future to disable something like the auto updating native price cache if it is only need for some parts of the api.

@vkgnosis vkgnosis changed the title Create canonical auction ids in ap Create canonical auction ids in api / How to handle pod auto scaling? Apr 28, 2022
@vkgnosis
Copy link
Contributor Author

Another solution could be to have all externally reachable endpoints be run by auto scaling pods while having an internal worker pod performing the inherently unparallelizable tasks. For example the worker task would update the auction id, store the current active auction etc in the database which is read from the database by the externally reachable pods. Likewise this task would store solution competition information in the database so it can be read by the externally reachable pods.

This has the advantage that no matter how bad a ddos is we will still be able to run these tasks because they are separate from all external requests.

Thinking more about this what I'm describing is basically the driver but with DB access. Previously we have thought about the driver as a thing that anyone could run themselves (coming from gp-v1) but here it has evolved into more of a backend component as it is responsible for querying the external solvers and submitting the transactions. (This nicely fits with one idea we had to implement the visible solution competition by having the driver upload its current competition state to the api by making the "uploading" shared DB access.)

I would call this new type of driver api-driver. To easily run the system locally the api binary can run the api-driver too (think of it as a rust async fn) but on kubernetes they would be separate deployments where the api scales and the api-driver is always a single pod (that is not externally reachable).

(This is very promising idea to me and was a big realization just now so I hope I'm explaining it right.)

@nlordell
Copy link
Contributor

nlordell commented Apr 28, 2022

The solution proposed in the comment above is a neat idea. So we could de-couple the backend "driver" services needed for doing auctions/settlement competition/database maintenance from pods that use the database only for adding orders and reading&serving data.

This would mean that we would be able to continue operating for users that have submitted orders (although, under DDoS, it might still be possible to prevent users from placing orders by DDoS-ing price estimation and order placement that are required by the FE).

Personally, I am in favour of this change.

@nlordell
Copy link
Contributor

One question about "comment solution" - how do you envision external solvers running off-premise to participate? Would they query a driver-api pod for the current auction and push new settlements to some driver-api endpoint that would add it to the database or would the driver pod poll them as we do today?

@vkgnosis
Copy link
Contributor Author

I'm not sure. With the current model the api-driver would be the one polling them so that keeps working as is.
If we wanted it to be push based it could probably still be done through the database. The api pods can store the solutions in it and the api-driver retrieves them from there. I would stay with the current approach for a while though to not make it more complicated too quickly.

@vkgnosis
Copy link
Contributor Author

One problem I'm thinking about now is how DB migrations will work. Currently we have this init container set up and there is only one deployment that uses the DB. If there are multiple containers then we could give both of them the init container but if one runs first it would break the other deployment. It is not a huge problem as all the containers should be auto deployed at roughly the same time but it still doesn't feel nice.

@MartinquaXD
Copy link
Contributor

One problem I'm thinking about now is how DB migrations will work...

I guess one solution could be to only give the "internal worker pod" (as you called it) write access to the DB and make everybody else push their updates through its API?
That what put the burden of authentication on the worker pod so depending on how difficult that is to do it correctly/securely this might not be desirable.

@vkgnosis
Copy link
Contributor Author

If everything had to go through that pod it would defeat the point of scaling.
On the other hand I think we already run into this DB problem because the old api pods stay up until the new one is healthy so it can already happen that the migrations run and prevent the old pods from working.

@MartinquaXD
Copy link
Contributor

If everything had to go through that pod it would defeat the point of scaling.

How many writes to the database do you expect from the new component? I thought it would be mostly read heavy.
But I guess it would also be kind of weird to give out read access to the DB and only force writes to go through some other pods because of DB migration issues.
I'm not sure about how difficult it would be to DDoS a DB with writes compared to DDoS-ing a pod which only forwards the writes to the DB.
I think I have to read up more closely on your idea.

@vkgnosis
Copy link
Contributor Author

vkgnosis commented Sep 6, 2022

Has been implemented.

@vkgnosis vkgnosis closed this as completed Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants