Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes on HTTPScaledObjects don't update routing table #605

Closed
t0rr3sp3dr0 opened this issue Feb 18, 2023 · 10 comments · Fixed by #669
Closed

Changes on HTTPScaledObjects don't update routing table #605

t0rr3sp3dr0 opened this issue Feb 18, 2023 · 10 comments · Fixed by #669
Labels
bug Something isn't working

Comments

@t0rr3sp3dr0
Copy link
Contributor

t0rr3sp3dr0 commented Feb 18, 2023

Report

The current implementation of the HTTP Add-on allows users to modify all attributes of the HTTPScaledObject, but the internal routing table implementation also assumes some of these fields are immutable. Modifying these fields causes unexpected behaviour, the routing table is never updated but the K8s objects are. I experienced that after changing the value of .spec.targetPendingRequests and having the scaler not behave as expected.

Expected Behavior

Either .spec.scaleTargetRef and .spec.targetPendingRequests should be immutable, or we should update the routing table when these fields get updated on the K8s object.

Actual Behavior

Changes to .spec.scaleTargetRef and .spec.targetPendingRequests are successfully applied to K8s, but the internal routing table is not updated to reflect these new values.

Steps to Reproduce the Problem

  1. Install Keda and the HTTP Add-on to a brand new K8s cluster and also deploy some example HTTP application to it
  2. Check the routing table of the HTTP Add-on is empty (http://keda-add-ons-http-interceptor-admin.svc.cluster.local:9090/routing_table)
  3. Create a valid HTTPScaledObject pointing to the application you deployed and with .spec.targetPendingRequests set to 2
  4. Check the routing table was updated with the details of the HTTPScaledObject created (http://keda-add-ons-http-interceptor-admin.svc.cluster.local:9090/routing_table)
  5. Edit the HTTPScaledObject and set .spec.targetPendingRequests set to 1
  6. Check the routing table was NOT updated and TargetPendingRequests is still 2, not 1 (http://keda-add-ons-http-interceptor-admin.svc.cluster.local:9090/routing_table)

Logs from KEDA HTTP operator

2023-02-17T22:14:46Z	INFO	controllers.HTTPScaledObject	Reconciliation start	{"HTTPScaledObject.Namespace": "default", "HTTPScaledObject.Name": "awesome-project"}

2023-02-17T22:14:46Z	INFO	controllers.HTTPScaledObject	Reconciling HTTPScaledObject	{"HTTPScaledObject.Namespace": "default", "HTTPScaledObject.Name": "awesome-project", "Namespace": "default", "DeploymentName": "awesome-project"}

2023-02-17T22:14:46Z	INFO	controllers.HTTPScaledObject	Creating scaled objects{"HTTPScaledObject.Namespace": "default", "HTTPScaledObject.Name": "awesome-project", "reconciler.appObjects": "addObjects", "HTTPScaledObject.name": "awesome-project", "HTTPScaledObject.namespace": "default", "external scaler host name": "keda-add-ons-http-external-scaler.keda:9090"}

2023-02-17T22:14:46Z	INFO	controllers.HTTPScaledObject	Creating App ScaledObject	{"HTTPScaledObject.Namespace": "default", "HTTPScaledObject.Name": "awesome-project", "reconciler.appObjects": "addObjects", "HTTPScaledObject.name": "awesome-project", "HTTPScaledObject.namespace": "default", "ScaledObject": {"Object":{"apiVersion":"keda.sh/v1alpha1","kind":"ScaledObject","metadata":{"labels":{"app":"kedahttp-awesome-project-app","name":"awesome-project-app"},"name":"awesome-project-app","namespace":"default"},"spec":{"advanced":{"restoreToOriginalReplicaCount":true},"maxReplicaCount":100,"minReplicaCount":0,"pollingInterval":1,"scaleTargetRef":{"kind":"Deployment","name":"awesome-project"},"triggers":[{"metadata":{"host":"awesome-project.t0rr3sp3dr0.dev","scalerAddress":"keda-add-ons-http-external-scaler.keda:9090"},"type":"external-push"}]}}}}

2023-02-17T22:14:47Z	ERROR	controllers.HTTPScaledObject.addAndUpdateRoutingTable	could not add host to routing table, progressing anyway	{"HTTPScaledObject.Namespace": "default", "HTTPScaledObject.Name": "awesome-project", "reconciler.appObjects": "addObjects", "HTTPScaledObject.name": "awesome-project", "HTTPScaledObject.namespace": "default", "host": "awesome-project.t0rr3sp3dr0.dev", "error": "host awesome-project.t0rr3sp3dr0.dev is already registered in the routing table"}
github.com/kedacore/http-add-on/operator/controllers.addAndUpdateRoutingTable
	/workspace/operator/controllers/routing_table.go:46
github.com/kedacore/http-add-on/operator/controllers.createOrUpdateApplicationResources
	/workspace/operator/controllers/app.go:132
github.com/kedacore/http-add-on/operator/controllers.(*HTTPScaledObjectReconciler).Reconcile
	/workspace/operator/controllers/httpscaledobject_controller.go:123
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235

2023-02-17T22:14:47Z	INFO	controllers.HTTPScaledObject	Updating status on HTTPScaledObject	{"HTTPScaledObject.Namespace": "default", "HTTPScaledObject.Name": "awesome-project", "resource version": "1941396"}

2023-02-17T22:14:47Z	INFO	controllers.HTTPScaledObject	Updated status on HTTPScaledObject	{"HTTPScaledObject.Namespace": "default", "HTTPScaledObject.Name": "awesome-project", "resource version": "1941399"}

2023-02-17T22:14:47Z	INFO	controllers.HTTPScaledObject	Updating status on HTTPScaledObject	{"HTTPScaledObject.Namespace": "default", "HTTPScaledObject.Name": "awesome-project", "resource version": "1941399"}

2023-02-17T22:14:47Z	INFO	controllers.HTTPScaledObject	Updated status on HTTPScaledObject	{"HTTPScaledObject.Namespace": "default", "HTTPScaledObject.Name": "awesome-project", "resource version": "1941400"}

2023-02-17T22:14:47Z	INFO	controllers.HTTPScaledObject	Reconcile success	{"HTTPScaledObject.Namespace": "default", "HTTPScaledObject.Name": "awesome-project"}

What version of the KEDA HTTP Add-on are you running?

0.4.0

Kubernetes Version

1.24

Platform

Microsoft Azure

Anything else?

I think the current behaviour is related to our lack of support for multiple HTTPScaledObject to have the same host. This implementation prevents a new HTTPScaledObject that uses the host of an existing HTTPScaledObject to override the configuration of the latter. This would also be a security issue, as users that have access to some namespace on a cluster could after resources on other namespaces.

I took a look at the implementation details of the routing table and propose we move from having it stored in a ConfigMap to compute its state based on the HTTPScaledObject present on the cluster and have the built routing table cached in-memory, having it refreshed periodically. Another option is to keep the current ConfigMap-based implementation, but it has some drawbacks that I'll list bellow.

Proposal 1 - HTTPScaledObject-based Routing Table

Rework the routing table implementation by relying on HTTPScaledObjects. Interceptors would compute the routing table based on these objects and cache the result in-memory. The cache would be updated based on K8s events and a full reconciliation performed periodically.

Benefits
  • Have the K8s API Server as the single source of truth
  • Internal routing table state isn't stored on user-writable storage and is not affected by external factors
  • Interceptors no longer have to rely on the controller to have accurate routing table information
  • Single place to have input validations performed
Drawbacks
  • Interceptors become watchers of all HTTPScaledObjects
  • Increased number of API calls to the K8s API Server (distributed routing table computation on interceptors)

Proposal 2 - ConfigMap-based Routing Table

Fix the current routing table implementation based on ConfigMap to have it updated when HTTPScaledObjects change.

Benefits
  • Reuse existing code/architecture
  • Reduced number of API calls to the K8s API Server (centralised routing table computation on controller)
Drawbacks
  • Makes harder for users to debug if something goes wrong (users need to know the source of truth is the ConfigMap, not the HTTPScaledObject instances)
  • Need to maintain the ConfigMap structure in sync with the HTTPScaledObject definition (developers need to know they need to make changes in both places for everything to work as expected)
  • The ConfigMap is basically an aggregate view of all HTTPScaledObjects that can get out-of-sync due to business logic bugs or outage of the controller-manager
  • Additional work to make ConfigMap solution more resilient (the HTTP Add-on is currently unable to self-heal in case the ConfigMap gets deleted or corrupted)

Proposal 3 - Make fields immutable

Update the CRD to make immutable the fields the routing table rely on.

Benefits
  • Quick fix
Drawbacks
  • Degraded user experience (users would have to delete HTTPScaledObject and create a new one HTTPScaledObject to update the settings)
  • Unable to update configuration without downtime (deletion of an HTTPScaledObject make the service unreachable)
@t0rr3sp3dr0 t0rr3sp3dr0 added the bug Something isn't working label Feb 18, 2023
@t0rr3sp3dr0
Copy link
Contributor Author

@JorTurFer and @tomkerkhove, could I get some input here so I can start working on a fix for this issue?

@JorTurFer
Copy link
Member

Hi,
I think that we shouldn't store it in configMap. I mean, (IMO), the interceptors should watch changes in HTTPScaledObjects and configure themselves on any change. This makes the HTTPScaledObject the source of the truth. The problem of this is that we need to validate them to ensure that they don't modify others.
If we use the kubeclient properly, it has a cache and we won't increase a lot the amount of queries, they will increase for sure, but the approach of configuring on the fly is what we are doing in KEDA for example.

WDYT @tomkerkhove ?

@tomkerkhove
Copy link
Member

I think the current behaviour is related to our lack of support for multiple HTTPScaledObject to have the same host. This implementation prevents a new HTTPScaledObject that uses the host of an existing HTTPScaledObject to override the configuration of the latter.

I think this is good that we do not support this, similarly to core where you can only have 1 scaledObject for a scale target.

With regards to the proposed designs, I think watching the objects is only acceptable if it's done by the operator. If we'll start querying Kubernetes for it then it adds load we can avoid and flaky scenarios where the routing table is outdated because it didn't receive the message.

What is the information is kept in the operator that the interceptor queries?

@t0rr3sp3dr0
Copy link
Contributor Author

The current implementation of the interceptor doesn't fetch data directly from the controller, the controller writes data to a ConfigMap and all interceptor instances watch that ConfigMap. The information contained on the ConfigMap is a list of all registered hosts and for each host we have their service name, port number, deployment name, namespace name, and target pending requests. That's used by the interceptor proxy to route requests properly. What I would like to do is to remove this indirection and grab the data directly from HTTPScaledObjects instead of relying on this ConfigMap containing a single big JSON.

We can have a fourth option where the interceptors fetch the data from the controller, instead of using the K8s API. But I worry that in case something goes wrong with the controller, we won't be able to route traffic, causing an outage on services using the add-on. With the current implementation, when the controller is down, new changes don't take effect but the interceptors are able to use existing routes until the problem is addressed.

@t0rr3sp3dr0
Copy link
Contributor Author

Some extra information here. I was taking a look at the RBAC configuration for the add-on and noticed that both the scaler and interceptor are watching K8s resources and it's not only the routing table ConfigMap. Both watch Deployments and the scaler also watches Endpoints. Right now, all three components share the same ServiceAccount with way too much access.

We may decide to change that as well and grab all the information from the controller or rely on DNS information for the readiness data, but I don't know if that's a good idea. I believe it's okay to watch K8s resources as long as you have well-scoped read-only access.

@tomkerkhove
Copy link
Member

Some extra information here. I was taking a look at the RBAC configuration for the add-on and noticed that both the scaler and interceptor are watching K8s resources and it's not only the routing table ConfigMap. Both watch Deployments and the scaler also watches Endpoints. Right now, all three components share the same ServiceAccount with way too much access.

This should be fixed with #612 so let's keep that out-of-scope here.

I am fine with going the HTTPScaledObject route; let's do that as @JorTurFer agreed on it as well - Thanks for the proposals!

@JorTurFer
Copy link
Member

Thanks for the proposals!

+1000

let's do that as @JorTurFer agreed on it as wel

I prefer this way indeed. Configuring the interceptors based on HTTPScaledObjects we will be always up to date, not have sync issues because new interceptors will read all the HTTPScaledObject and configure themselves.

@t0rr3sp3dr0
Copy link
Contributor Author

Perfect! I'll start working on working this on Monday.

@stale
Copy link

stale bot commented Apr 26, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Apr 26, 2023
@t0rr3sp3dr0
Copy link
Contributor Author

Don’t close

@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Apr 26, 2023
@github-project-automation github-project-automation bot moved this from To Triage to Done in Roadmap - KEDA HTTP Add-On Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants